Búsqueda | Portal Regional de la BVS

1.

The effect of data transformation on low-dimensional integration of single-cell RNA-seq.

Park, Youngjun; Hauschild, Anne-Christin.

BMC Bioinformatics ; 25(1): 171, 2024 Apr 30.

Artículo en Inglés | MEDLINE | ID: mdl-38689234

RESUMEN

BACKGROUND: Recent developments in single-cell RNA sequencing have opened up a multitude of possibilities to study tissues at the level of cellular populations. However, the heterogeneity in single-cell sequencing data necessitates appropriate procedures to adjust for technological limitations and various sources of noise when integrating datasets from different studies. While many analysis procedures employ various preprocessing steps, they often overlook the importance of selecting and optimizing the employed data transformation methods. RESULTS: This work investigates data transformation approaches used in single-cell clustering analysis tools and their effects on batch integration analysis. In particular, we compare 16 transformations and their impact on the low-dimensional representations, aiming to reduce the batch effect and integrate multiple single-cell sequencing data. Our results show that data transformations strongly influence the results of single-cell clustering on low-dimensional data space, such as those generated by UMAP or PCA. Moreover, these changes in low-dimensional space significantly affect trajectory analysis using multiple datasets, as well. However, the performance of the data transformations greatly varies across datasets, and the optimal method was different for each dataset. Additionally, we explored how data transformation impacts the analysis of deep feature encodings using deep neural network-based models, including autoencoder-based models and proto-typical networks. Data transformation also strongly affects the outcome of deep neural network models. CONCLUSIONS: Our findings suggest that the batch effect and noise in integrative analysis are highly influenced by data transformation. Low-dimensional features can integrate different batches well when proper data transformation is applied. Furthermore, we found that the batch mixing score on low-dimensional space can guide the selection of the optimal data transformation. In conclusion, data preprocessing is one of the most crucial analysis steps and needs to be cautiously considered in the integrative analysis of multiple scRNA-seq datasets.

Asunto(s)

RNA-Seq , Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , RNA-Seq/métodos , Análisis por Conglomerados , Humanos , Análisis de Secuencia de ARN/métodos , Algoritmos , Redes Neurales de la Computación , Análisis de Expresión Génica de una Sola Célula

2.

CLARUS: An interactive explainable AI platform for manual counterfactuals in graph neural networks.

Metsch, Jacqueline Michelle; Saranti, Anna; Angerschmid, Alessa; Pfeifer, Bastian; Klemt, Vanessa; Holzinger, Andreas; Hauschild, Anne-Christin.

J Biomed Inform ; 150: 104600, 2024 02.

Artículo en Inglés | MEDLINE | ID: mdl-38301750

RESUMEN

BACKGROUND: Lack of trust in artificial intelligence (AI) models in medicine is still the key blockage for the use of AI in clinical decision support systems (CDSS). Although AI models are already performing excellently in systems medicine, their black-box nature entails that patient-specific decisions are incomprehensible for the physician. Explainable AI (XAI) algorithms aim to "explain" to a human domain expert, which input features influenced a specific recommendation. However, in the clinical domain, these explanations must lead to some degree of causal understanding by a clinician. RESULTS: We developed the CLARUS platform, aiming to promote human understanding of graph neural network (GNN) predictions. CLARUS enables the visualisation of patient-specific networks, as well as, relevance values for genes and interactions, computed by XAI methods, such as GNNExplainer. This enables domain experts to gain deeper insights into the network and more importantly, the expert can interactively alter the patient-specific network based on the acquired understanding and initiate re-prediction or retraining. This interactivity allows us to ask manual counterfactual questions and analyse the effects on the GNN prediction. CONCLUSION: We present the first interactive XAI platform prototype, CLARUS, that allows not only the evaluation of specific human counterfactual questions based on user-defined alterations of patient networks and a re-prediction of the clinical outcome but also a retraining of the entire GNN after changing the underlying graph structures. The platform is currently hosted by the GWDG on https://rshiny.gwdg.de/apps/clarus/.

Asunto(s)

Sistemas de Apoyo a Decisiones Clínicas , Médicos , Humanos , Inteligencia Artificial , Redes Neurales de la Computación , Algoritmos , Tolnaftato

3.

Species-agnostic transfer learning for cross-species transcriptomics data integration without gene orthology.

Park, Youngjun; Muttray, Nils P; Hauschild, Anne-Christin.

Brief Bioinform ; 25(2)2024 Jan 22.

Artículo en Inglés | MEDLINE | ID: mdl-38305455

RESUMEN

Novel hypotheses in biomedical research are often developed or validated in model organisms such as mice and zebrafish and thus play a crucial role. However, due to biological differences between species, translating these findings into human applications remains challenging. Moreover, commonly used orthologous gene information is often incomplete and entails a significant information loss during gene-id conversion. To address these issues, we present a novel methodology for species-agnostic transfer learning with heterogeneous domain adaptation. We extended the cross-domain structure-preserving projection toward out-of-sample prediction. Our approach not only allows knowledge integration and translation across various species without relying on gene orthology but also identifies similar GO among the most influential genes composing the latent space for integration. Subsequently, during the alignment of latent spaces, each composed of species-specific genes, it is possible to identify functional annotations of genes missing from public orthology databases. We evaluated our approach with four different single-cell sequencing datasets focusing on cell-type prediction and compared it against related machine-learning approaches. In summary, the developed model outperforms related methods working without prior knowledge when predicting unseen cell types based on other species' data. The results demonstrate that our novel approach allows knowledge transfer beyond species barriers without the dependency on known gene orthology but utilizing the entire gene sets.

Asunto(s)

Algoritmos , Pez Cebra , Ratones , Humanos , Animales , Pez Cebra/genética , Perfilación de la Expresión Génica , Especificidad de la Especie , Aprendizaje Automático

4.

A primer on the use of machine learning to distil knowledge from data in biological psychiatry.

Quinn, Thomas P; Hess, Jonathan L; Marshe, Victoria S; Barnett, Michelle M; Hauschild, Anne-Christin; Maciukiewicz, Malgorzata; Elsheikh, Samar S M; Men, Xiaoyu; Schwarz, Emanuel; Trakadis, Yannis J; Breen, Michael S; Barnett, Eric J; Zhang-James, Yanli; Ahsen, Mehmet Eren; Cao, Han; Chen, Junfang; Hou, Jiahui; Salekin, Asif; Lin, Ping-I; Nicodemus, Kristin K; Meyer-Lindenberg, Andreas; Bichindaritz, Isabelle; Faraone, Stephen V; Cairns, Murray J; Pandey, Gaurav; Müller, Daniel J; Glatt, Stephen J.

Mol Psychiatry ; 2024 Jan 04.

Artículo en Inglés | MEDLINE | ID: mdl-38177352

RESUMEN

Applications of machine learning in the biomedical sciences are growing rapidly. This growth has been spurred by diverse cross-institutional and interdisciplinary collaborations, public availability of large datasets, an increase in the accessibility of analytic routines, and the availability of powerful computing resources. With this increased access and exposure to machine learning comes a responsibility for education and a deeper understanding of its bases and bounds, borne equally by data scientists seeking to ply their analytic wares in medical research and by biomedical scientists seeking to harness such methods to glean knowledge from data. This article provides an accessible and critical review of machine learning for a biomedically informed audience, as well as its applications in psychiatry. The review covers definitions and expositions of commonly used machine learning methods, and historical trends of their use in psychiatry. We also provide a set of standards, namely Guidelines for REporting Machine Learning Investigations in Neuropsychiatry (GREMLIN), for designing and reporting studies that use machine learning as a primary data-analysis approach. Lastly, we propose the establishment of the Machine Learning in Psychiatry (MLPsych) Consortium, enumerate its objectives, and identify areas of opportunity for future applications of machine learning in biological psychiatry. This review serves as a cautiously optimistic primer on machine learning for those on the precipice as they prepare to dive into the field, either as methodological practitioners or well-informed consumers.

5.

Ensemble-GNN: federated ensemble learning with graph neural networks for disease module discovery and classification.

Pfeifer, Bastian; Chereda, Hryhorii; Martin, Roman; Saranti, Anna; Clemens, Sandra; Hauschild, Anne-Christin; Beißbarth, Tim; Holzinger, Andreas; Heider, Dominik.

Bioinformatics ; 39(11)2023 11 01.

Artículo en Inglés | MEDLINE | ID: mdl-37988152

RESUMEN

SUMMARY: Federated learning enables collaboration in medicine, where data is scattered across multiple centers without the need to aggregate the data in a central cloud. While, in general, machine learning models can be applied to a wide range of data types, graph neural networks (GNNs) are particularly developed for graphs, which are very common in the biomedical domain. For instance, a patient can be represented by a protein-protein interaction (PPI) network where the nodes contain the patient-specific omics features. Here, we present our Ensemble-GNN software package, which can be used to deploy federated, ensemble-based GNNs in Python. Ensemble-GNN allows to quickly build predictive models utilizing PPI networks consisting of various node features such as gene expression and/or DNA methylation. We exemplary show the results from a public dataset of 981 patients and 8469 genes from the Cancer Genome Atlas (TCGA). AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/pievos101/Ensemble-GNN, and the data at Zenodo (DOI: 10.5281/zenodo.8305122).

Asunto(s)

Metilación de ADN , Aprendizaje Automático , Humanos , Redes Neurales de la Computación , Mapas de Interacción de Proteínas , Programas Informáticos

6.

The FeatureCloud Platform for Federated Learning in Biomedicine: Unified Approach.

Matschinske, Julian; Späth, Julian; Bakhtiari, Mohammad; Probul, Niklas; Kazemi Majdabadi, Mohammad Mahdi; Nasirigerdeh, Reza; Torkzadehmahani, Reihaneh; Hartebrodt, Anne; Orban, Balazs-Attila; Fejér, Sándor-József; Zolotareva, Olga; Das, Supratim; Baumbach, Linda; Pauling, Josch K; Tomasevic, Olivera; Bihari, Béla; Bloice, Marcus; Donner, Nina C; Fdhila, Walid; Frisch, Tobias; Hauschild, Anne-Christin; Heider, Dominik; Holzinger, Andreas; Hötzendorfer, Walter; Hospes, Jan; Kacprowski, Tim; Kastelitz, Markus; List, Markus; Mayer, Rudolf; Moga, Mónika; Müller, Heimo; Pustozerova, Anastasia; Röttger, Richard; Saak, Christina C; Saranti, Anna; Schmidt, Harald H H W; Tschohl, Christof; Wenke, Nina K; Baumbach, Jan.

J Med Internet Res ; 25: e42621, 2023 07 12.

Artículo en Inglés | MEDLINE | ID: mdl-37436815

RESUMEN

BACKGROUND: Machine learning and artificial intelligence have shown promising results in many areas and are driven by the increasing amount of available data. However, these data are often distributed across different institutions and cannot be easily shared owing to strict privacy regulations. Federated learning (FL) allows the training of distributed machine learning models without sharing sensitive data. In addition, the implementation is time-consuming and requires advanced programming skills and complex technical infrastructures. OBJECTIVE: Various tools and frameworks have been developed to simplify the development of FL algorithms and provide the necessary technical infrastructure. Although there are many high-quality frameworks, most focus only on a single application case or method. To our knowledge, there are no generic frameworks, meaning that the existing solutions are restricted to a particular type of algorithm or application field. Furthermore, most of these frameworks provide an application programming interface that needs programming knowledge. There is no collection of ready-to-use FL algorithms that are extendable and allow users (eg, researchers) without programming knowledge to apply FL. A central FL platform for both FL algorithm developers and users does not exist. This study aimed to address this gap and make FL available to everyone by developing FeatureCloud, an all-in-one platform for FL in biomedicine and beyond. METHODS: The FeatureCloud platform consists of 3 main components: a global frontend, a global backend, and a local controller. Our platform uses a Docker to separate the local acting components of the platform from the sensitive data systems. We evaluated our platform using 4 different algorithms on 5 data sets for both accuracy and runtime. RESULTS: FeatureCloud removes the complexity of distributed systems for developers and end users by providing a comprehensive platform for executing multi-institutional FL analyses and implementing FL algorithms. Through its integrated artificial intelligence store, federated algorithms can easily be published and reused by the community. To secure sensitive raw data, FeatureCloud supports privacy-enhancing technologies to secure the shared local models and assures high standards in data privacy to comply with the strict General Data Protection Regulation. Our evaluation shows that applications developed in FeatureCloud can produce highly similar results compared with centralized approaches and scale well for an increasing number of participating sites. CONCLUSIONS: FeatureCloud provides a ready-to-use platform that integrates the development and execution of FL algorithms while reducing the complexity to a minimum and removing the hurdles of federated infrastructure. Thus, we believe that it has the potential to greatly increase the accessibility of privacy-preserving and distributed data analyses in biomedicine and beyond.

Asunto(s)

Algoritmos , Inteligencia Artificial , Humanos , Empleos en Salud , Programas Informáticos , Redes de Comunicación de Computadores , Privacidad

7.

Analysis of a Deep Learning Model for 12-Lead ECG Classification Reveals Learned Features Similar to Diagnostic Criteria.

Bender, Theresa; Beinecke, Jacqueline M; Krefting, Dagmar; Muller, Carolin; Dathe, Henning; Seidler, Tim; Spicher, Nicolai; Hauschild, Anne-Christin.

IEEE J Biomed Health Inform ; PP2023 May 01.

Artículo en Inglés | MEDLINE | ID: mdl-37126621

RESUMEN

Despite their remarkable performance, deep neural networks remain unadopted in clinical practice, which is considered to be partially due to their lack of explainability. In this work, we apply explainable attribution methods to a pre-trained deep neural network for abnormality classification in 12-lead electrocardiography to open this "black box" and understand the relationship between model prediction and learned features. We classify data from two public databases (CPSC 2018, PTB-XL) and the attribution methods assign a "relevance score" to each sample of the classified signals. This allows analyzing what the network learned during training, for which we propose quantitative methods: average relevance scores over a) classes, b) leads, and c) average beats. The analyses of relevance scores for atrial fibrillation and left bundle branch block compared to healthy controls show that their mean values a) increase with higher classification probability and correspond to false classifications when around zero, and b) correspond to clinical recommendations regarding which lead to consider. Furthermore, c) visible P-waves and concordant T-waves result in clearly negative relevance scores in atrial fibrillation and left bundle branch block classification, respectively. Results are similar across both databases despite differences in study population and hardware. In summary, our analysis suggests that the DNN learned features similar to cardiology textbook knowledge.

8.

MirDIP 5.2: tissue context annotation and novel microRNA curation.

Hauschild, Anne-Christin; Pastrello, Chiara; Ekaputeri, Gitta Kirana Anindya; Bethune-Waddell, Dylan; Abovsky, Mark; Ahmed, Zuhaib; Kotlyar, Max; Lu, Richard; Jurisica, Igor.

Nucleic Acids Res ; 51(D1): D217-D225, 2023 01 06.

Artículo en Inglés | MEDLINE | ID: mdl-36453996

RESUMEN

MirDIP is a well-established database that aggregates microRNA-gene human interactions from multiple databases to increase coverage, reduce bias, and improve usability by providing an integrated score proportional to the probability of the interaction occurring. In version 5.2, we removed eight outdated resources, added a new resource (miRNATIP), and ran five prediction algorithms for miRBase and mirGeneDB. In total, mirDIP 5.2 includes 46 364 047 predictions for 27 936 genes and 2734 microRNAs, making it the first database to provide interactions using data from mirGeneDB. Moreover, we curated and integrated 32 497 novel microRNAs from 14 publications to accelerate the use of these novel data. In this release, we also extend the content and functionality of mirDIP by associating contexts with microRNAs, genes, and microRNA-gene interactions. We collected and processed microRNA and gene expression data from 20 resources and acquired information on 330 tissue and disease contexts for 2657 microRNAs, 27 576 genes and 123 651 910 gene-microRNA-tissue interactions. Finally, we improved the usability of mirDIP by enabling the user to search the database using precursor IDs, and we integrated miRAnno, a network-based tool for identifying pathways linked to specific microRNAs. We also provide a mirDIP API to facilitate access to its integrated predictions. Updated mirDIP is available at https://ophid.utoronto.ca/mirDIP.

Asunto(s)

MicroARNs , Humanos , Algoritmos , Bases de Datos de Ácidos Nucleicos , Epistasis Genética , MicroARNs/genética , MicroARNs/metabolismo , Anotación de Secuencia Molecular , Curaduría de Datos

9.

Guideline for software life cycle in health informatics.

Hauschild, Anne-Christin; Martin, Roman; Holst, Sabrina Celine; Wienbeck, Joachim; Heider, Dominik.

iScience ; 25(12): 105534, 2022 Dec 22.

Artículo en Inglés | MEDLINE | ID: mdl-36437879

RESUMEN

The long-lasting trend of medical informatics is to adapt novel technologies in the medical context. In particular, incorporating artificial intelligence to support clinical decision-making can significantly improve monitoring, diagnostics, and prognostics for the patient's and medic's sake. However, obstacles hinder a timely technology transfer from research to the clinic. Due to the pressure for novelty in the research context, projects rarely implement quality standards. Here, we propose a guideline for academic software life cycle processes tailored to the needs and capabilities of research organizations. While the complete implementation of a software life cycle according to commercial standards is not feasible in scientific work, we propose a subset of elements that we are convinced will provide a significant benefit while keeping the effort within a feasible range. Ultimately, the emerging quality checks for academic software development can pave the way for an accelerated deployment of academic advances in clinical practice.

10.

Editorial: Computational systems biomedicine.

Batra, Richa; Baloni, Priyanka; Alcaraz, Nicolas; Hauschild, Anne-Christin; Cervera, Alejandra.

Front Genet ; 13: 1047760, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-36313421

11.

Evaluation of machine learning strategies for imaging confirmed prostate cancer recurrence prediction on electronic health records.

Beinecke, Jacqueline Michelle; Anders, Patrick; Schurrat, Tino; Heider, Dominik; Luster, Markus; Librizzi, Damiano; Hauschild, Anne-Christin.

Comput Biol Med ; 143: 105263, 2022 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-35131608

RESUMEN

BACKGROUND: The main screening parameter to monitor prostate cancer recurrence (PCR) after primary treatment is the serum concentration of prostate-specific antigen (PSA). In recent years, Ga-68-PSMA PET/CT has become an important method for additional diagnostics in patients with biochemical recurrence. PURPOSE: While Ga-68-PSMA PET/CT performs better, it is an expensive, invasive, and time-consuming examination. Therefore, in this study, we aim to employ modern multivariate Machine Learning (ML) methods on electronic health records (EHR) of prostate cancer patients to improve the prediction of imaging confirmed PCR (IPCR). METHODS: We retrospectively analyzed the clinical information of 272 patients, who were examined using Ga-68-PSMA PET/CT. The PSA values ranged from 0 ng/mL to 2270.38 ng/mL with a median PSA level at 1.79 ng/mL. We performed a descriptive analysis using Logistic Regression. Additionally, we evaluated the predictive performance of Logistic Regression, Support Vector Machine, Gradient Boosting, and Random Forest. Finally, we assessed the importance of all features using Ensemble Feature Selection (EFS). RESULTS: The descriptive analysis found significant associations between IPCR and logarithmic PSA values as well as between IPCR and performed hormonal therapy. Our models were able to predict IPCR with an AUC score of 0.78 ± 0.13 (mean ± standard deviation) and a sensitivity of 0.997 ± 0.01. Features such as PSA, PSA doubling time, PSA velocity, hormonal therapy, radiation treatment, and injected activity show high importance for IPCR prediction using EFS. CONCLUSION: This study demonstrates the potential of employing a multitude of parameters into multivariate ML models to improve identification of non-recurring patients compared to the current focus on the main screening parameter (PSA). We showed that ML models are able to predict IPCR, detectable by Ga-68-PSMA PET/CT, and thereby pave the way for optimized early imaging and treatment.

12.

Federated Random Forests can improve local performance of predictive models for various healthcare applications.

Hauschild, Anne-Christin; Lemanczyk, Marta; Matschinske, Julian; Frisch, Tobias; Zolotareva, Olga; Holzinger, Andreas; Baumbach, Jan; Heider, Dominik.

Bioinformatics ; 38(8): 2278-2286, 2022 04 12.

Artículo en Inglés | MEDLINE | ID: mdl-35139148

RESUMEN

MOTIVATION: Limited data access has hindered the field of precision medicine from exploring its full potential, e.g. concerning machine learning and privacy and data protection rules.Our study evaluates the efficacy of federated Random Forests (FRF) models, focusing particularly on the heterogeneity within and between datasets. We addressed three common challenges: (i) number of parties, (ii) sizes of datasets and (iii) imbalanced phenotypes, evaluated on five biomedical datasets. RESULTS: The FRF outperformed the average local models and performed comparably to the data-centralized models trained on the entire data. With an increasing number of models and decreasing dataset size, the performance of local models decreases drastically. The FRF, however, do not decrease significantly. When combining datasets of different sizes, the FRF vastly improve compared to the average local models. We demonstrate that the FRF remain more robust and outperform the local models by analyzing different class-imbalances.Our results support that FRF overcome boundaries of clinical research and enables collaborations across institutes without violating privacy or legal regulations. Clinicians benefit from a vast collection of unbiased data aggregated from different geographic locations, demographics and other varying factors. They can build more generalizable models to make better clinical decisions, which will have relevance, especially for patients in rural areas and rare or geographically uncommon diseases, enabling personalized treatment. In combination with secure multi-party computation, federated learning has the power to revolutionize clinical practice by increasing the accuracy and robustness of healthcare AI and thus paving the way for precision medicine. AVAILABILITY AND IMPLEMENTATION: The implementation of the federated random forests can be found at https://featurecloud.ai/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Privacidad , Bosques Aleatorios , Aprendizaje Automático , Medicina de Precisión , Atención a la Salud

13.

Prediction of antimicrobial resistance based on whole-genome sequencing and machine learning.

Ren, Yunxiao; Chakraborty, Trinad; Doijad, Swapnil; Falgenhauer, Linda; Falgenhauer, Jane; Goesmann, Alexander; Hauschild, Anne-Christin; Schwengers, Oliver; Heider, Dominik.

Bioinformatics ; 38(2): 325-334, 2022 01 03.

Artículo en Inglés | MEDLINE | ID: mdl-34613360

RESUMEN

MOTIVATION: Antimicrobial resistance (AMR) is one of the biggest global problems threatening human and animal health. Rapid and accurate AMR diagnostic methods are thus very urgently needed. However, traditional antimicrobial susceptibility testing (AST) is time-consuming, low throughput and viable only for cultivable bacteria. Machine learning methods may pave the way for automated AMR prediction based on genomic data of the bacteria. However, comparing different machine learning methods for the prediction of AMR based on different encodings and whole-genome sequencing data without previously known knowledge remains to be done. RESULTS: In this study, we evaluated logistic regression (LR), support vector machine (SVM), random forest (RF) and convolutional neural network (CNN) for the prediction of AMR for the antibiotics ciprofloxacin, cefotaxime, ceftazidime and gentamicin. We could demonstrate that these models can effectively predict AMR with label encoding, one-hot encoding and frequency matrix chaos game representation (FCGR encoding) on whole-genome sequencing data. We trained these models on a large AMR dataset and evaluated them on an independent public dataset. Generally, RFs and CNNs perform better than LR and SVM with AUCs up to 0.96. Furthermore, we were able to identify mutations that are associated with AMR for each antibiotic. AVAILABILITY AND IMPLEMENTATION: Source code in data preparation and model training are provided at GitHub website (https://github.com/YunxiaoRen/ML-iAMR). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Antibacterianos , Farmacorresistencia Bacteriana , Animales , Humanos , Antibacterianos/farmacología , Farmacorresistencia Bacteriana/genética , Ciprofloxacina , Aprendizaje Automático , Genómica , Bacterias/genética

14.

Fractal construction of constrained code words for DNA storage systems.

Löchel, Hannah F; Welzel, Marius; Hattab, Georges; Hauschild, Anne-Christin; Heider, Dominik.

Nucleic Acids Res ; 50(5): e30, 2022 03 21.

Artículo en Inglés | MEDLINE | ID: mdl-34908135

RESUMEN

The use of complex biological molecules to solve computational problems is an emerging field at the interface between biology and computer science. There are two main categories in which biological molecules, especially DNA, are investigated as alternatives to silicon-based computer technologies. One is to use DNA as a storage medium, and the other is to use DNA for computing. Both strategies come with certain constraints. In the current study, we present a novel approach derived from chaos game representation for DNA to generate DNA code words that fulfill user-defined constraints, namely GC content, homopolymers, and undesired motifs, and thus, can be used to build codes for reliable DNA storage systems.

Asunto(s)

Biología Computacional/métodos , ADN , Fractales

15.

Transfer learning compensates limited data, batch effects and technological heterogeneity in single-cell sequencing.

Park, Youngjun; Hauschild, Anne-Christin; Heider, Dominik.

NAR Genom Bioinform ; 3(4): lqab104, 2021 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-34805988

RESUMEN

Tremendous advances in next-generation sequencing technology have enabled the accumulation of large amounts of omics data in various research areas over the past decade. However, study limitations due to small sample sizes, especially in rare disease clinical research, technological heterogeneity and batch effects limit the applicability of traditional statistics and machine learning analysis. Here, we present a meta-transfer learning approach to transfer knowledge from big data and reduce the search space in data with small sample sizes. Few-shot learning algorithms integrate meta-learning to overcome data scarcity and data heterogeneity by transferring molecular pattern recognition models from datasets of unrelated domains. We explore few-shot learning models with large scale public dataset, TCGA (The Cancer Genome Atlas) and GTEx dataset, and demonstrate their potential as pre-training dataset in other molecular pattern recognition tasks. Our results show that meta-transfer learning is very effective for datasets with a limited sample size. Furthermore, we show that our approach can transfer knowledge across technological heterogeneity, for example, from bulk cell to single-cell data. Our approach can overcome study size constraints, batch effects and technical limitations in analyzing single-cell data by leveraging existing bulk-cell sequencing data.

16.

Integrative Analysis of Next-Generation Sequencing for Next-Generation Cancer Research toward Artificial Intelligence.

Park, Youngjun; Heider, Dominik; Hauschild, Anne-Christin.

Cancers (Basel) ; 13(13)2021 Jun 24.

Artículo en Inglés | MEDLINE | ID: mdl-34202427

RESUMEN

The rapid improvement of next-generation sequencing (NGS) technologies and their application in large-scale cohorts in cancer research led to common challenges of big data. It opened a new research area incorporating systems biology and machine learning. As large-scale NGS data accumulated, sophisticated data analysis methods became indispensable. In addition, NGS data have been integrated with systems biology to build better predictive models to determine the characteristics of tumors and tumor subtypes. Therefore, various machine learning algorithms were introduced to identify underlying biological mechanisms. In this work, we review novel technologies developed for NGS data analysis, and we describe how these computational methodologies integrate systems biology and omics data. Subsequently, we discuss how deep neural networks outperform other approaches, the potential of graph neural networks (GNN) in systems biology, and the limitations in NGS biomedical research. To reflect on the various challenges and corresponding computational solutions, we will discuss the following three topics: (i) molecular characteristics, (ii) tumor heterogeneity, and (iii) drug discovery. We conclude that machine learning and network-based approaches can add valuable insights and build highly accurate models. However, a well-informed choice of learning algorithm and biological network information is crucial for the success of each specific research question.

17.

Fostering reproducibility, reusability, and technology transfer in health informatics.

Hauschild, Anne-Christin; Eick, Lisa; Wienbeck, Joachim; Heider, Dominik.

iScience ; 24(7): 102803, 2021 Jul 23.

Artículo en Inglés | MEDLINE | ID: mdl-34296072

RESUMEN

Computational methods can transform healthcare. In particular, health informatics with artificial intelligence has shown tremendous potential when applied in various fields of medical research and has opened a new era for precision medicine. The development of reusable biomedical software for research or clinical practice is time-consuming and requires rigorous compliance with quality requirements as defined by international standards. However, research projects rarely implement such measures, hindering smooth technology transfer into the research community or manufacturers as well as reproducibility and reusability. Here, we present a guideline for quality management systems (QMS) for academic organizations incorporating the essential components while confining the requirements to an easily manageable effort. It provides a starting point to implement a QMS tailored to specific needs effortlessly and greatly facilitates technology transfer in a controlled manner, thereby supporting reproducibility and reusability. Ultimately, the emerging standardized workflows can pave the way for an accelerated deployment in clinical practice.

18.

A large-scale comparative study on peptide encodings for biomedical classification.

Spänig, Sebastian; Mohsen, Siba; Hattab, Georges; Hauschild, Anne-Christin; Heider, Dominik.

NAR Genom Bioinform ; 3(2): lqab039, 2021 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-34046590

RESUMEN

Owing to the great variety of distinct peptide encodings, working on a biomedical classification task at hand is challenging. Researchers have to determine encodings capable to represent underlying patterns as numerical input for the subsequent machine learning. A general guideline is lacking in the literature, thus, we present here the first large-scale comprehensive study to investigate the performance of a wide range of encodings on multiple datasets from different biomedical domains. For the sake of completeness, we added additional sequence- and structure-based encodings. In particular, we collected 50 biomedical datasets and defined a fixed parameter space for 48 encoding groups, leading to a total of 397 700 encoded datasets. Our results demonstrate that none of the encodings are superior for all biomedical domains. Nevertheless, some encodings often outperform others, thus reducing the initial encoding selection substantially. Our work offers researchers to objectively compare novel encodings to the state of the art. Our findings pave the way for a more sophisticated encoding optimization, for example, as part of automated machine learning pipelines. The work presented here is implemented as a large-scale, end-to-end workflow designed for easy reproducibility and extensibility. All standardized datasets and results are available for download to comply with FAIR standards.

19.

Genome-wide analysis suggests the importance of vascular processes and neuroinflammation in late-life antidepressant response.

Marshe, Victoria S; Maciukiewicz, Malgorzata; Hauschild, Anne-Christin; Islam, Farhana; Qin, Li; Tiwari, Arun K; Sibille, Etienne; Blumberger, Daniel M; Karp, Jordan F; Flint, Alastair J; Turecki, Gustavo; Lam, Raymond W; Milev, Roumen V; Frey, Benicio N; Rotzinger, Susan; Foster, Jane A; Kennedy, Sidney H; Kennedy, James L; Mulsant, Benoit H; Reynolds, Charles F; Lenze, Eric J; Müller, Daniel J.

Transl Psychiatry ; 11(1): 127, 2021 02 15.

Artículo en Inglés | MEDLINE | ID: mdl-33589590

RESUMEN

Antidepressant outcomes in older adults with depression is poor, possibly because of comorbidities such as cerebrovascular disease. Therefore, we leveraged multiple genome-wide approaches to understand the genetic architecture of antidepressant response. Our sample included 307 older adults (≥60 years) with current major depression, treated with venlafaxine extended-release for 12 weeks. A standard genome-wide association study (GWAS) was conducted for post-treatment remission status, followed by in silico biological characterization of associated genes, as well as polygenic risk scoring for depression, neurodegenerative and cerebrovascular disease. The top-associated variants for remission status and percentage symptom improvement were PIEZO1 rs12597726 (OR = 0.33 [0.21, 0.51], p = 1.42 × 10-6) and intergenic rs6916777 (Beta = 14.03 [8.47, 19.59], p = 1.25 × 10-6), respectively. Pathway analysis revealed significant contributions from genes involved in the ubiquitin-proteasome system, which regulates intracellular protein degradation with has implications for inflammation, as well as atherosclerotic cardiovascular disease (n = 25 of 190 genes, p = 8.03 × 10-6, FDR-corrected p = 0.01). Given the polygenicity of complex outcomes such as antidepressant response, we also explored 11 polygenic risk scores associated with risk for Alzheimer's disease and stroke. Of the 11 scores, risk for cardioembolic stroke was the second-best predictor of non-remission, after being male (Accuracy = 0.70 [0.59, 0.79], Sensitivity = 0.72, Specificity = 0.67; p = 2.45 × 10-4). Although our findings did not reach genome-wide significance, they point to previously-implicated mechanisms and provide support for the roles of vascular and inflammatory pathways in LLD. Overall, significant enrichment of genes involved in protein degradation pathways that may be impaired, as well as the predictive capacity of risk for cardioembolic stroke, support a link between late-life depression remission and risk for vascular dysfunction.

Asunto(s)

Trastorno Depresivo Mayor , Estudio de Asociación del Genoma Completo , Anciano , Antidepresivos/uso terapéutico , Trastorno Depresivo Mayor/tratamiento farmacológico , Trastorno Depresivo Mayor/genética , Humanos , Canales Iónicos , Masculino , Herencia Multifactorial , Clorhidrato de Venlafaxina/uso terapéutico

20.

Computational strategies to combat COVID-19: useful tools to accelerate SARS-CoV-2 and coronavirus research.

Hufsky, Franziska; Lamkiewicz, Kevin; Almeida, Alexandre; Aouacheria, Abdel; Arighi, Cecilia; Bateman, Alex; Baumbach, Jan; Beerenwinkel, Niko; Brandt, Christian; Cacciabue, Marco; Chuguransky, Sara; Drechsel, Oliver; Finn, Robert D; Fritz, Adrian; Fuchs, Stephan; Hattab, Georges; Hauschild, Anne-Christin; Heider, Dominik; Hoffmann, Marie; Hölzer, Martin; Hoops, Stefan; Kaderali, Lars; Kalvari, Ioanna; von Kleist, Max; Kmiecinski, Renó; Kühnert, Denise; Lasso, Gorka; Libin, Pieter; List, Markus; Löchel, Hannah F; Martin, Maria J; Martin, Roman; Matschinske, Julian; McHardy, Alice C; Mendes, Pedro; Mistry, Jaina; Navratil, Vincent; Nawrocki, Eric P; O'Toole, Áine Niamh; Ontiveros-Palacios, Nancy; Petrov, Anton I; Rangel-Pineros, Guillermo; Redaschi, Nicole; Reimering, Susanne; Reinert, Knut; Reyes, Alejandro; Richardson, Lorna; Robertson, David L; Sadegh, Sepideh; Singer, Joshua B.

Brief Bioinform ; 22(2): 642-663, 2021 03 22.

Artículo en Inglés | MEDLINE | ID: mdl-33147627

RESUMEN

SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.

Asunto(s)

COVID-19/prevención & control , Biología Computacional , SARS-CoV-2/aislamiento & purificación , Investigación Biomédica , COVID-19/epidemiología , COVID-19/virología , Genoma Viral , Humanos , Pandemias , SARS-CoV-2/genética

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA